Voxtral Realtime: enable CUDA backend with int4 quantization by mergennachin · Pull Request #17798 · pytorch/executorch

mergennachin · 2026-03-02T22:30:05Z

Add CUDA/AOTI backend support for the Voxtral Realtime model alongside
the existing XNNPACK and Metal backends.

Model (model.py):

CudaSDPA: F.scaled_dot_product_attention with repeat_interleave for
GQA expansion and boolean attention masks (Triton SDPA requirement)
StaticKVCache (shared with Metal) for [B,H,S,D] layout with index_copy_
StandardEncoderRingKVCache/StandardEncoderSDPA for streaming encoder
_build_causal_mask_bool: 4D boolean mask for Triton compatibility
Simplified LMAttention.forward to always pass attn_mask (None for XNNPACK)

Export (export_voxtral_rt.py):

--backend cuda with CudaPartitioner and conv1d_to_conv2d decomposition
--dtype flag (default fp32, bf16 for CUDA Triton SDPA)
--qlinear-packing-format / --qlinear-encoder-packing-format for
tile_packed_to_4d int4 quantization
CUDA device placement, Dim.AUTO for audio encoder, .ptd output

Runner (main.cpp, voxtral_realtime_runner.cpp/.h):

--data_path flag for .ptd delegate data (CUDA compiled kernels)
Module two-arg constructor for pte+ptd loading

Build (CMakePresets.json, Makefile):

voxtral-realtime-cuda preset
make voxtral_realtime-cuda target

CI (.github/workflows/cuda.yml, .ci/scripts/):

Voxtral Realtime in CUDA CI matrix (int4-tile-packed, offline mode)
Export/test scripts updated for CUDA quantization args and data path

pytorch-bot · 2026-03-02T22:30:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17798

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 1 Cancelled Job

As of commit e5c3690 with merge base 0907294 ():

NEW FAILURES - The following jobs have failed:

pull / unittest / macos / macos-job (gh)
AttributeError: '_OpNamespace' 'mkldnn' object has no attribute '_is_mkldnn_acl_supported'
pull / unittest-arm-backend-with-no-deps (test_run_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t 97a4f1a75a6d6b10c49842e66e62ae6f96e29447594f7f0134d0da39d83aa497 /exec failed with exit code 1
pull / unittest-editable / macos / macos-job (gh)
AttributeError: '_OpNamespace' 'mkldnn' object has no attribute '_is_mkldnn_acl_supported'
trunk / unittest-release / macos / macos-job (gh)
AttributeError: '_OpNamespace' 'mkldnn' object has no attribute '_is_mkldnn_acl_supported'

CANCELLED JOB - The following job was cancelled. Please retry:

trunk / test-models-macos-coreml (emformer_transcribe) / macos-job (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-02T22:30:45Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

examples/models/voxtral_realtime/voxtral_realtime_runner.cpp:612

logits_to_token() recreates/reseeds a Sampler on every decode step (seeded from std::time(nullptr)), so temperature > 0 sampling won’t have a stable RNG stream across tokens and can become repetitive. Since StreamingSession already has a sampler_ member, it would be better to use that persistent sampler (with dtype switching for Float/BFloat16/Half) instead of calling logits_to_token() each step.

      prev_token_, static_cast<uint64_t>(next_token));
  if (piece.ok()) {
    token_cb_(*piece);
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

examples/models/voxtral_realtime/voxtral_realtime_runner.cpp

Copilot · 2026-03-03T17:00:45Z

.ci/scripts/export_model_artifact.sh

  python -m executorch.examples.models.voxtral_realtime.export_voxtral_rt \
      --model-path "$LOCAL_MODEL_DIR" \
      --backend "$DEVICE" \
      ${STREAMING_ARG} \
      --output-dir "${OUTPUT_DIR}" \
-      ${VR_QUANT_ARGS}
+      ${VR_QUANT_ARGS} \
+      ${VR_DTYPE_ARGS}


In the voxtral_realtime export path, the script doesn’t validate that the CUDA delegate data file (aoti_cuda_blob.ptd) was produced. Since the runner requires --data_path for CUDA, it’d be safer to add a test -f "${OUTPUT_DIR}/aoti_cuda_blob.ptd" check when DEVICE=cuda (similar to the Parakeet branch) so export failures are caught immediately.

metascroy · 2026-03-03T19:46:35Z

examples/models/voxtral_realtime/README.md

 |---------|---------|-----------|--------------|
 | `xnnpack` | ✓ | ✓ | `4w`, `8w`, `8da4w`, `8da8w` |
 | `metal` | ✓ | ✓ | none (fp32) or `fpa4w` (Metal-specific 4-bit) |
+| `cuda` | ✓ | ✓ | `4w`, `8w`, `8da4w`, `8da8w` |


Does Cuda support 8da4w/8da8w?

Related, I'm pretty sure xnnpack does not support 4w/8w.

@metascroy

Does Cuda support 8da4w/8da8w?

Good catch, will fix.

Related, I'm pretty sure xnnpack does not support 4w/8w.

xnnpack supports per-channel 4w and 8w. For example, we use 8w for token embeddings.

ET's embedding CPU op supports weight only schemes, but I don't think xnnpack supports weight-only quantization for linear layers.

With that said, 4w/8da4w and 8w/8da8w quantize weight data the same. The only difference is the 8da variants add fake activation quantization in front.

@manuelcandales is there any plan for metal aoti to use int4/int8 for a more uniform experience.

The kernel should support it because I'm using int4/int8 with MLX.

manuelcandales · 2026-03-03T20:07:01Z

examples/models/voxtral_realtime/README.md

+    --model-path ~/models/Voxtral-Mini-4B-Realtime-2602 \
+    --backend cuda \
+    --dtype bf16 \
+    --streaming \


if this is supported, then why not test it in CI?

manuelcandales · 2026-03-03T20:07:13Z

.github/workflows/cuda.yml

        fi

-        source .ci/scripts/export_model_artifact.sh cuda "${{ matrix.model.repo }}/${{ matrix.model.name }}" "${{ matrix.quant }}" "${RUNNER_ARTIFACT_DIR}"
+        # Voxtral Realtime uses offline mode for CUDA CI (not streaming)


why not streaming?

Add CUDA/AOTI backend support for the Voxtral Realtime model alongside the existing XNNPACK and Metal backends. Model (model.py): - CudaSDPA: F.scaled_dot_product_attention with repeat_interleave for GQA expansion and boolean attention masks (Triton SDPA requirement) - StaticKVCache (shared with Metal) for [B,H,S,D] layout with index_copy_ - StandardEncoderRingKVCache/StandardEncoderSDPA for streaming encoder - _build_causal_mask_bool: 4D boolean mask for Triton compatibility - Simplified LMAttention.forward to always pass attn_mask (None for XNNPACK) Export (export_voxtral_rt.py): - --backend cuda with CudaPartitioner and conv1d_to_conv2d decomposition - --dtype flag (default fp32, bf16 for CUDA Triton SDPA) - --qlinear-packing-format / --qlinear-encoder-packing-format for tile_packed_to_4d int4 quantization - CUDA device placement, Dim.AUTO for audio encoder, .ptd output Runner (main.cpp, voxtral_realtime_runner.cpp/.h): - --data_path flag for .ptd delegate data (CUDA compiled kernels) - Module two-arg constructor for pte+ptd loading Build (CMakePresets.json, Makefile): - voxtral-realtime-cuda preset - make voxtral_realtime-cuda target CI (.github/workflows/cuda.yml, .ci/scripts/): - Voxtral Realtime in CUDA CI matrix (int4-tile-packed, offline mode) - Export/test scripts updated for CUDA quantization args and data path

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-04T04:54:11Z

examples/models/voxtral_realtime/export_voxtral_rt.py

@@ -474,15 +552,22 @@ def main():

    os.makedirs(args.output_dir, exist_ok=True)

-    # Load model
+    model_dtype = {"fp32": torch.float32, "bf16": torch.bfloat16}[args.dtype]
+
    print("Loading model...")
    model = load_model(
        args.model_path,
        max_seq_len=args.max_seq_len,
        n_delay_tokens=args.delay_tokens,
+        dtype=model_dtype,
        backend=args.backend,
    )

+    # Move to CUDA for CUDA backend export (AOTInductor needs CUDA tensors)
+    if args.backend == "cuda":
+        print("Moving model to CUDA...")
+        model.cuda()
+


For --backend cuda, leaving --dtype at the current default (fp32) is likely to produce an exported model that fails at runtime/compile time once SDPA is replaced by the CUDA Triton triton::sdpa op, which currently enforces bfloat16 inputs. Consider either (a) making bf16 the default when --backend cuda, (b) erroring out if --backend cuda and --dtype fp32, or (c) automatically setting a CUDA compile spec (e.g., triton_kernel_mode=OFF) when exporting fp32 so SDPA falls back to a non-Triton implementation.

Copilot · 2026-03-04T04:54:12Z

.ci/scripts/test_model_e2e.sh

+    # Add CUDA data path if present
+    if [ "$DEVICE" = "cuda" ] && [ -f "${MODEL_DIR}/aoti_cuda_blob.ptd" ]; then
+      RUNNER_ARGS="$RUNNER_ARGS --data_path ${MODEL_DIR}/aoti_cuda_blob.ptd"
+    fi


This block appends --data_path ... for CUDA, but the script already adds --data_path ${MODEL_DIR}/aoti_cuda_blob.ptd for all non-llama runners earlier (before the model-specific case). For Voxtral Realtime on CUDA this results in duplicate --data_path arguments. Please remove this per-model addition (or refactor the earlier common CUDA handling to avoid double-appending for voxtral_realtime).

Suggested change

# Add CUDA data path if present

if [ "$DEVICE" = "cuda" ] && [ -f "${MODEL_DIR}/aoti_cuda_blob.ptd" ]; then

RUNNER_ARGS="$RUNNER_ARGS --data_path ${MODEL_DIR}/aoti_cuda_blob.ptd"

fi

Copilot AI review requested due to automatic review settings March 2, 2026 22:30

mergennachin requested review from kirklandsign, larryliu0820 and lucylq as code owners March 2, 2026 22:30

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 2, 2026

mergennachin requested a review from manuelcandales March 2, 2026 22:30

Copilot started reviewing on behalf of mergennachin March 2, 2026 22:32 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

mergennachin temporarily deployed to upload-benchmark-results March 2, 2026 23:33 — with GitHub Actions Inactive

mergennachin force-pushed the enable_voxtral_realtime branch from 1e5399a to afe08f0 Compare March 3, 2026 15:18

mergennachin temporarily deployed to upload-benchmark-results March 3, 2026 16:11 — with GitHub Actions Inactive

mergennachin force-pushed the enable_voxtral_realtime branch from afe08f0 to 50e3a3d Compare March 3, 2026 16:48

Copilot AI review requested due to automatic review settings March 3, 2026 16:48

Copilot started reviewing on behalf of mergennachin March 3, 2026 16:49 View session

mergennachin force-pushed the enable_voxtral_realtime branch from 50e3a3d to e708015 Compare March 3, 2026 16:57

Copilot AI reviewed Mar 3, 2026

View reviewed changes

mergennachin force-pushed the enable_voxtral_realtime branch from e708015 to 05f0ed2 Compare March 3, 2026 17:17

mergennachin temporarily deployed to upload-benchmark-results March 3, 2026 18:28 — with GitHub Actions Inactive

larryliu0820 approved these changes Mar 3, 2026

View reviewed changes

metascroy reviewed Mar 3, 2026

View reviewed changes

manuelcandales reviewed Mar 3, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 4, 2026 04:43

mergennachin force-pushed the enable_voxtral_realtime branch from 05f0ed2 to e5c3690 Compare March 4, 2026 04:43

Copilot started reviewing on behalf of mergennachin March 4, 2026 04:44 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

mergennachin temporarily deployed to upload-benchmark-results March 4, 2026 05:42 — with GitHub Actions Inactive

mergennachin merged commit 5193141 into main Mar 4, 2026
374 of 379 checks passed

mergennachin deleted the enable_voxtral_realtime branch March 4, 2026 12:36

Conversation

mergennachin commented Mar 2, 2026

Uh oh!

pytorch-bot bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17798

❌ 4 New Failures, 1 Cancelled Job

Uh oh!

github-actions bot commented Mar 2, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

metascroy Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

mergennachin Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

metascroy Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

metascroy Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

manuelcandales Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

manuelcandales Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pytorch-bot bot commented Mar 2, 2026 •

edited

Loading

This PR needs a `release notes:` label

metascroy Mar 3, 2026 •

edited

Loading